{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas和数据库查询语言SQL的对比\n",
    "\n",
    "* Pandas：Python最流行的数据处理与数据分析的类库\n",
    "* SQL：结构化查询语言，用于对MySQL、Oracle等关系型数据库的增删改查\n",
    "\n",
    "两者都是对“表格型”数据的操作和查询，所以很多语法都能对应起来\n",
    "\n",
    "对比列表：\n",
    "1. SELECT数据查询\n",
    "2. WHERE按条件查询\n",
    "3. in和not in的条件查询\n",
    "4. groupby分组统计\n",
    "5. JOIN数据关联\n",
    "6. UNION数据合并\n",
    "7. Order Limit先排序后分页\n",
    "8. 取每个分组group的top n\n",
    "9. UPDATE数据更新\n",
    "10. DELETE删除数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0. 读取泰坦尼克数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"./datas/titanic/titanic_train.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. SELECT数据查询"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT PassengerId, Sex, Age, Survived\n",
    "    FROM titanic\n",
    "    LIMIT 5;\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pandas\n",
    "df[[\"PassengerId\", \"Sex\", \"Age\", \"Survived\"]].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "df.head(5)类似select * from table limit 5，查询所有的字段"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. WHERE按条件查询"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT *\n",
    "    FROM titanic\n",
    "    where Sex='male' and Age>=20.0 and Age<=40.0\n",
    "    LIMIT 5;\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 使用括号的方式，级联多个条件|\n",
    "condition = (df[\"Sex\"]==\"male\") & (df[\"Age\"]>=20.0) & (df[\"Age\"]<=40.0)\n",
    "condition.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[condition].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. in和not in的条件查询"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Pclass\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT *\n",
    "    FROM titanic\n",
    "    where Pclass in (1,2)\n",
    "    LIMIT 5;\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# in \n",
    "df[df[\"Pclass\"].isin((1,2))].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# not in \n",
    "df[~df[\"Pclass\"].isin((1,2))].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. groupby分组统计"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.1 单个列的聚合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT \n",
    "        -- 分性别的存活人数\n",
    "        sum(Survived),\n",
    "        -- 分性别的平均年龄\n",
    "        mean(Age)\n",
    "        -- 分性别的平均票价\n",
    "        mean(Fare)\n",
    "    FROM titanic\n",
    "    group by Sex\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(\"Sex\").agg({\"Survived\":np.sum, \"Age\":np.mean, \"Fare\":np.mean})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2 多个列的聚合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT \n",
    "        -- 不同存活和性别分组的，平均年龄\n",
    "        mean(Age)\n",
    "        -- 不同存活和性别分组的，平均票价\n",
    "        mean(Fare)\n",
    "    FROM titanic\n",
    "    group by Survived, Sex\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby([\"Survived\", \"Sex\"]).agg({\"Age\":np.mean, \"Fare\":np.mean})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. JOIN数据关联"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 电影评分数据集，评分表\n",
    "df_rating = pd.read_csv(\"./datas/ml-latest-small/ratings.csv\")\n",
    "df_rating.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 电影评分数据集，电影信息表\n",
    "df_movies = pd.read_csv(\"./datas/ml-latest-small/movies.csv\")\n",
    "df_movies.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT *\n",
    "    FROM \n",
    "        rating join movies \n",
    "        on(rating.movieId=movies.movieId)\n",
    "    limit 5\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_merged = pd.merge(left=df_rating, right=df_movies, on=\"movieId\")\n",
    "df_merged.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. UNION数据合并"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],\n",
    "                    'rank': range(1, 4)}) \n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],\n",
    "                    'rank': [1, 4, 5]})\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT city, rank\n",
    "    FROM df1\n",
    "    \n",
    "    UNION ALL\n",
    "    \n",
    "    SELECT city, rank\n",
    "    FROM df2;\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pandas\n",
    "pd.concat([df1, df2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Order Limit先排序后分页"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    SELECT *\n",
    "    from titanic\n",
    "    order by Fare\n",
    "    limit 5\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sort_values(\"Fare\", ascending=False).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8. 取每个分组group的top n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# MYSQL不支持\n",
    "# Oracle有ROW_NUMBER语法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 按（Survived，Sex）分组，取Age的TOP 2\n",
    "df.groupby([\"Survived\", \"Sex\"]).apply(\n",
    "    lambda df:df.sort_values(\"Age\", ascending=False).head(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9. UPDATE数据更新"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    UPDATE titanic\n",
    "    set Age=0\n",
    "    where Age is null\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "condition = df[\"Age\"].isna()\n",
    "condition.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[condition] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Age\"].isna().value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10. DELETE删除数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL：\n",
    "sql = \"\"\"\n",
    "    DELETE FROM titanic\n",
    "    where Age=0\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_new = df[df[\"Age\"]!=0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_new[df_new[\"Age\"]==0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
